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Remarks 

1 . I shortened the abstract to cc»nply with the 1 50 word lim it. 

2. In the prior version, I had made the drawings separate. I have now added the words "Replacement 
5 Sheet" to the top of each drawing page. I also annotated the two pages with changed drawings, namely 

Figures 2 and Figures 6 and 7. 

3. I left in place the change marks provided by Microsoft Word, as we discussed in an earlier telephone 
conversation.. 

4. I removed the indefinite term "experimentally" from the phrase that describes the determination of the 
10 prefetch distance, and added to the phrasing required to make claim 1 allowable by describing in detail 

how linked list semantics are maintained in an equivalent prefetchable data structure. 

5. I marked to the two cancelled claims ""(Cancelled)'^ rather than deleting them altogether and removed 
^e ensuing renumbering of following claims. I also marked the unmodified claims '^Original)'' if no 
changes were made (excepting the undoing of the renumbering). 

15 6. I marked claim 1 5, to which I had added a colon after "steps of, with "(Currently am ended)". 

7. I replaced "according to claim 13" with '^vherein a tree is constructed as a forest of trees". 

8. I added text from claims 8 into independent claim 15, but claim 10 is applicable to linked lists rather 
than trees. The tree traversals in this particular context are more or less static (i.e. regular exjxession 
trees that a compiler might traverse after parsing has completed). However, I added sufficient 

20 information to indicate how additions to and deletions from the tree in the ^irit of the cancelled claim 

10 are accomplished 
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A Method for Prefetching Recursive Data Structure Traversals 

Field of the Invention 

This invention addresses the problem of prefetching indirect memory references commonly found 
in applications anplojdng pointer-based data structures such as trees, linked lists, and graphs. More 
5 specifically, the invention relates to a method for pipehning traversals on these data structures in a way that 
makes it possible to employ data prefetching into high speed caches closer to the CPU from slow memcs-y. 
It further specifies a means of scheduling prefetch operations on data so as to improve the throughput of the 
ccxnputer system by overlapping the prefetching of future memory references with the execution of 
previously cached data, 

10 Background of the Invention 

Modem microprocessors employ multiple levels of memory of varying speeds to reduce the 
latency of references to data stored in memory. Memc»:ies i^ysically closer to the microprocesscr typically 
qperate at speeds much closer to that of the m icroprocessor, but are constrained in the amount of data they 
can stOTe at any given point in time. Memories further from the processor tend to consist of large dynamic 

15 random access memory (DRAM) that can accommodate a large amount of data and instructions, but 

introduce an imdesirable latency when the instructions or data cannot be found in the primary, secondary, 
or tertiary caches. Rrior art has addressed this memory latency problem by prefetching data and/or 
instructions into the one or more of the cache memories through explicit or implicit prefetch operations. 
The prefetch (derations do not stall the processor, but allow computation on otho* data to overlap with the 

20 transfer of the prefetch operand from other levels of the memory hierarchy. Prefetch operations require the 
cc»npiler or the programmer to predict with some degree of accuracy which mem ay locatiais will be 
referenced in the future. For certain mathematical constructs such as arrays and matrices, these memory 
locations can be computed a priori In contrast, the memory reference patterns of frie traversals of certain 
data stmctures such as linked lists^ trees, and graphs are generally unpredictable because the nodes that 

25 m ake up the graph are frequently allocated at run tim e. 

In modem transaction processing systems, database servers, operating systems, and other 
ccmimercial and engineering applications^ information is frequently organized in trees, graphs, and linked 
lists. Lack of spatial locality results in a high probability that a miss will be incurred at each cache in the 
memory hierarchy. Each cache miss causes flic processor to stall while the referenced value is fetched from 

30 lower levels of the memcxy hierarchy. Because this is likely to be tiie case for a significant fraction of tfie 
nodes traversed in the data structure, processor utilization will suffer. 

The inability to compute the address of the next address to be referenced makes prefetching 
difficult in such applications. The invention allows compilers and/or programmers to restructure data 
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Structures and traversals so that pointers are dereferenced in a pipelined manner, thereby making it possible 
to schedule prefetch operaticns in a consistent fashion. 

References Cited 
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10 

Description of Prior Art 

Multi-threading and multiple context processors have been described in prior art as a means of 
hiding memory latoicy in applications. The context of a thread typically consists of the value of its 
regbters at a given point in time. The scheduling of threads can occur dynamically or via cycle-by-cycle 

15 interleaving. Neither approach has proven practical in modem microprocessor designs. Their usefulness is 
bounded by the context switch time (i.e. the amount of time required to drain the execution pipelines) and 
the number of contexts that can be supported in hardware. The higher the miss rate of an application, the 
more contexts must be supported in hardware. Similarly, the longer the memory latency, the mere work 
must be performed by other threads in order to hide memory latency. The more time that expires before a 

20 stalled thread is scheduled to execute again, the greater the likelihood that one of the other threads has 
caused a future operand of the stalled thread to be evacuated from the cache, thereby increasing the miss 
rater, and so creating a vicious cycle. 

Non-blocking loads are similar to software controlled prefetch operations, in that the pa-ogrammer 
or compiler attempts to move the register load operation sufi'iciently far in advance of the first utilization of 

25 said register so as to hide a potential cache miss. Non-blocking loads bind a memcHy operand to a register 
early in the instruction stream . Early binding has the drawback that it is difficult to maintain program 
correctness in pointer based codes because loads cannot be moved ahead of a store unless it is certain that 
they are to different memory locations. Memory disambiguaticMi is a difficult problem far compilers to 
solve, e^ecially in p>ointer-based codes. 

30 In order to effectively prefetch linked lists^ prior art has employed prefetch pointers at each node 

of the linked list Each prefetch pointer is assigned the address of a list element sufficiently far down the 
traversal path of the linked list so that a prefetch request may be issued far enougih in advance for the 
element to arrive in cache before the element is actuaUy reached in the course of the ordinary traversal. 
The storage overhead for prefetch pointers is 0(N)- Furthermore, the data structure cannot be subject to 

35 frequent change, since the cost of maintaining the prefetch pointers can be prohibitive. Anotha- approach 
advocated by prior art is embedding the data structure in an array. This removes the 0(N) stca-age overhead 
incurred with prefetch pointers^ but eliminates the benefits of employir^ a pointer-based data structure as 
well. 
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Similar to a linked list traversals, traversal of a tree data structurfe would have to prefetch more 
than a single node ahead in the traversal path in order to hide any significant memory latency. In codes 
where both the data structure and the traversal path through the data structure remain static over the course 
of many traversals, it may be possible to maintain a traversal history pointer at each node, as illustrated in 
5 figure 3. Maintaining the adjunct history pointer adds significant storage space overhead for each of the 
pointers. The approach can also incur significant runtime overhead to maintain the history pointers 
. whenever the data structure is updated because the data structure must be traversed in its entirety in order to 
ensure that the correct nodes are prefetched. 
Summary of the Invention 

1 0 The present invention significantly increases the cache hit rates of many impotant data structure 

traversals, and thereby the potential throughput of the computer system and application in which it is 
employed. Fes' data structure traversals in which the traversal path may be predetermined, a transformation 
is performed cm the data structure that permits references to nodes that will be traversed in the future be 
ccxnputed sufficiently far in advance to prefetch the data into cache. 

1 5 For data structure traversals in which the traversal path may be predetermined, the underlying data 

structure is given an alternative representation of multiple sub-structures. Thus a linked list is implemented 
as a group of linked lists in ^e following manner: The first element of the linked list is placed at some 
predetermined location in the data structure representing the group. The second element is placed at 
another location in the group data structure. A function is determined that sequentially yields the address 

20 of the location in the first location of each linked list in the group.^ A prefetch request is then issued fca- the 
first elements of each of the iS/' lists, where N is sufficiently large so that a prefetch operation can hide the 
latency of cache miss. As each list element in each list is processed, a prefetch request may be issued for 
the next element in the list A separate group of position pointers maintains the positicn of the traversal of 
each of the N lists, and is updated as each node is processed. The next node to be traversed is the node in 

25 the next list (rather than the next element of a given list). Each node indicated by the position pointers is 
therefore visited in the order indicated by the aforementioned function. If the function is given by 
f(x)=(x+l) modulo AT, and the group of position pointers is represented by an array P, then the position 
pointers indicated are traversed in the order P[0],P[1],P[2], P[JV],P[0]. P[l], ... Asthe listelemoit 
pointed to by each positicxi pointer is traversed, each position pointer is updated to point to the next element 

30 of the list. 

The same method can be applied to general pointer-based data structures. Tree data structures are 
frequently used to represent sets, for instance. The inventiai represents a tree as a group of trees, 
henceforth referred to as a forest in this application. Instead of traversing the nodes of a single tree, the 
traversal of trees are conducted in a pipelined fashion. As a node in a given tree is processed, a prefetch 
35 request is issued for ^e appropriate child that is to be visited next in that subtree. Alternatively, it is 

' The group of lists can be represented by an array, and the ftmction merely increments the array index by 
one, i.e. f(x) =^ (x + 1) modulo M 
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possible to issue a prefetch request for the address indicated in the updated position pointer at positiai 
P[(/+D) modulo F|, where / is the current position, D is the number of trees across which a prefetch must 
occur in order to hide latency introduced by a cache miss, and F is the number of trees in the forest. 

5 Brief Description of the Drawings 

Figure 1 illustrates a linked list according to prior art, with OCN) storage overhead, where N 
corresponds to the number of elements in the list. 

Figure 2 illustrates a linked list implemaitation that preserves 0(1) push and pop, enqueue and 
dequeue operations, yet is prefetchable with only 0(1) st<xage overhead. The list in this example is 
10 constructed of four sublists. So, Si, Sz, and S3. List element 1 can be deleted by assigning sublist header 
S3 to the element pointed to the child pointer of element 1, i.e. element 5. The index of the head of the list, 
head_index is then incremented modulo P, where P is the number of sublists. Similarly, deletion frcMn 
the tail decremmts the index of the variable indicating the tail element, while the parent of the linked list 
element is assigned a child pointer value indicating no ftirther children. 
1 5 Figure 3 illustrates an implementation of a tree with history pointers with 0(N) stcwage overhead. 

The history list is constructed during a separate traversal of the data structure. 

Figure 4 illustrates an implementation of a tree data structure that is prefetchable. Multiple 
subtrees, in this example. To, Tj, and T2, are represented as a group by means of the data structure 6h- In 
this example, the group is structured as an array, but any representation of the group is applicable. 
20 Figure 5 illustrates how a tree traversal is modified into a forest traversal. 

Figure 6 shows the performance improvement achieved by traversing linked lists with a varying 
distance (in bytes) between elements of the linked list, where a distance of zero indicates that the two 
linked list elements were adjacent to each other in memory. Linked list elements were of size 8 bjrtes. 

Figure 7 shows the performance improvement achieved by applying prefetching to a post-order 
25 traversal of a tree, with varying prefetch distances (and thus the number of trees) are represented on die 
horizontal axis. Performance is normalized to &e traversal of a traditional linked list of the same length, 
and is plotted on the vertical axis. 

Figure 8 provides an example of a pipelined linked list traversal with prefetching. The array 
elements s [ i ] maintain the traversal pointers for each of the sublists 5| of figure 2. For tiiis example, it is 
30 assumed that the actual work on each element is performed by the subroutine process_element ( ) , 
which is assumed to return a value cm-esponding to the token STOP when a stopping point has been 
reached, such as the end of the list or an element that is being searched for, etc. The variable p indicates 
the depth of the software pipeline, i.e. the number of cycles required to hide the latency of a memory 
reference. The token PREFETCH is used to indicate a prefetch request for the address stored in the 
35 subsequent variable. 

Figure P is a code fragment that provides an example of a pipelined traversal of a set of trees. 
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Figure 10 is a code fragment that provides an example of a pipelined level order traversal, v/hidi 
is used to generate a list of trees across which a pipelined traversal can subsequently be performed. 
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Detailed Description 

Prefetching pointer-based data structures is much more difficult that prefetching data structures 
with regular access patterns. In order to prefetch array based data structures, Klaiber and Levy proposed 
using software pipelining - a method of issuing a prefetch request during one loop iteration for a memory 
5 operand that would be used in a future iteration. For example, during loop iteration j in which an m^yX[j] 
is processed, a prefetch request is issued for the operand Xfj+dJ, where d is the number of loop iterations 
required to hide the memory latency of a cache miss. The problem with this method of scheduling p>refetch 
requests, prior to the introduction of this invention, is that it could not be applied to pointer-based data 
structures. The invention partitions pointer based data structures into multiple sub-structures, and then 
10 schedules prefetch requests by pipelining accesses across multiple substructures in a manner similar to that 
described by Klaiber and Levy. The application of the invention is illustrated on two important data 
structures below, linked lists and trees. 

The invention consists of the following method. Step 1 is to create a parallel data structure 
cmsisting of /ST partitions. Step 1 can be performed by means of transforming an existing data structure 
into a parallel data structure;, by generating the implementation via a class library or container classes in an 
object oriented system, or by a compiler. Step 2 is to pipeline the traversal across the partitions of the 
data structure. Step 3 is to determine the prefetch distance required in OTder to traverse the data structure of 
step 1 using the pipelined traversal of step 2. The ixefetch distance may be determined experimentally by 
the programmer, computed using prior art^ or by the compiler. Step 3 is to insert prefetch instructims into 
the traversal loop body (the steady state loop). The steady state loop may be optionally preceded by a 
prologue which performs no data structure traversal, but which does generate prefetch instructions. The 
steady state loop may be followed by an epilogue in which no prefetch instructions are performed, but in 
which traversal of the data structure continues and possibly completes. 

These methods can be illustrated by means of a linked list traversal. Instead of maintaining a 
jump pointer as described by Luk and Mowry, the linked list is partitioned into, or constructed as^ p 
sublists. The list header is augmented to save the index of the last sublist to which an element was added, 
as well as the index of the list that contains the curroit header. An additional state vector s is associated 
with the list to maintain the current pointer into each sublist If the order in whidi the nodes are appended 
to the list is /cs hy • , in, then /, is added to the end of list i modulo p. If the head of the list resides in sublist 
h and is to be deleted, then the value of the list head index, H, is updated to /i+7 modulo p. 

A node is added to the head of the list by updating the list head index to h-I modulo dand 
35 inserting the node at the head of that list Assumingacorrespondingarray of tail pointers, elementscan be 
inserted and deleted from the tail of the list in a similar fashion. This arrangement makes it possible to 
maintain much of the flexibility of linked lists while preserving the traversal order, which may be an 
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important consideration for managing event queues or other FIFO structures of sufficient size to warrant 
prefetching. If traversal C3rder is not a factor, or insertion and deletion from an arbitrary position in the list 
must be svipported, then the process can be modified to simply contain painters into the list approximately 
the same distance apart. 

5 

The code fragment in figure 8 illustrates the software pipelined traversal of a set of sublists. If the 
traversal is completed before the end of the list, then any extra prefetch requests represent pure overhead 
and imnecessary additional mem My traffic. While the invention applies to both uniprocessors and 
multiprocessors, even in a uniprocessor, the CPU shares the memory bus with I/O controllers. Since I am 
10 primarily concerned with aggregate throughput, for a significantly long list the cost of these few cases can 
be quickly amortized. 

The method works well if Ae amount of work required to traverse from any given node in the data 
structure to its successor is small. Freorder traversal of a tree, in contrast, requires work at each node to 

1 5 determine the next node. The work arises from maintaining the stack and determining whether to follow 
the left or the right diild pointer. Our approach requires less memory and is more flexible with respect to 
insertions and deletions than Luk and Mowrys method. The number of sublists may be larger than the 
pipeline depth for any one traversal loop. Thus if the number of sublists is selected to be sufficiently large 
to accommodate the largest pipeline depth of any traversal loop that the application is apt to encounter, then 

20 the prefetch distance can still be adjusted to an optimal value. 

The sublist method allowed dequeue performance to improve by a factor of 2.85 over an ordinary 
linked list implementation, as illustrated in figure 7. 

25 The method employed for hiding latency in linked hst traversals can also be applied to trees. 

There are two operaticxis commonly perfumed on static trees: performing some operation on the entire tree 
and searching a tree for particular nodes. Operations performed on an entire tree, are addressed in this 
section. Miss rates for the traversal of an entire tree will be high, since there is very little reuse among 
cached nodes during the traversal process. Finding a node in tree-structured indices is common to database 

30 applications, and is addressed in a separate, concurrently submitted, patent application. 

An alternative approach uses a parallel traversal to accomplish the same goal by maintaining the 
state of the parallel traversals. Software pipelining is performed across the parallel traversals, rather than 
within a single traversal. In order to facilitate the parallelism, the tree is partitioned into a forest of d trees, 
35 where d is the software pipeline depth required to hide memory latency. This approach trades off runtime 
overhead for storage. History pointers require 0(N) extra storage, while the software pipelined approach 
incurs 0(^0 extra storage for the state vector and requires 0(^log N) storage for maintenance of multiple 
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Stacks. The runtime overhead of the software pipelined approach results from maintaining the state of 
multiple parallel traversals. 

Software pipelined traversals of a forest of binary trees is illustrated in figure 9. The data structure 
5 in this example does not contain parent pointers. Each tree in tiie forest is traversed in an in-order fashion, 
and software pipelining occurs across the traversals of each tree in the forest in a round-robin fashion. 
Software pipelining advances the traversal of each tree in the forest by one node before switching to the 
next tree, performing a prefetch for the left cm- right child when the current node is advanced, 

10 The same approach can be appUed to a recursive versicxi. I selected an iterative version to 

illustrate this approach because it makes the management of the stack explicit. The prologue code is used 
to initialize the state vector s and prefetch the root nodes of each of the trees in the forest At some point 
during the traversal process^ one of the traversals will necessarily complete before the cithers^ causing the 
variable representing the number of active traversals, p, to be decremented In order to maintain the state 

1 5 of active traversals at ccmsecutive locatiois of s, the state location of a completed traversal is always 

replaced by the state of the last active traversal, located at the position indicated by the decremented value 
of p. 



As a result of prefetching across parallel traversals, there is no epilogue code. As the number of 
20 parallel traversals that are in progress decreases^ so does the effective depth of the software pipeline, and 
hence the available prefetch distance. In a balanced binary tree, this is not much of a problem because all 
traversal requests will complete within a short time of each other. One way to guard against the problem is 
to increase the number of trees in the forest, with the hope that a sufficient amount of parallelism will be 
available among them for a lc»iger duration. Increasing the depth of the software pipeline brings with it 
25 potential interference from the additional stack space and state representations that the minor decrease in 
the depth of the average tree cannot compensate for. Figure 7 shows the effect of varying the number of 
trees in a fc^-est of 100 thousand nodes with random keys. A large number of trees is clearly advantageous. 

Whffli the traversal order is a requirement, the data structure can follow a similar approach to that 
30 employed for linked lists. For a pre-order traversal, for instance, the tree is built as a forest of trees^ where 
node «, is inserted into tree r, n«>«iuioi„ where p is the total number of trees in the forest A post-order 
traversal follows an analagous construction methodology. 

It is not always acceptable to maintain a forest instead of a single tree. In those cases where die 
35 traversal order is not important, as when the tree is used to represent a set, the tree can be partitioned by 
means of a level-order traversal. The \log d\ - I nodes closest to the root can be traversed and processed 
in a level-order fa^on during the prologue. The children of level [log c/l - 1 are stared in the state vector 
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s, and the search commences on this forest of subtrees as before. The prologue that performs this task is 
illustrated in figure 10. With each iteration, src_queue contains the nodes of the current level, and 
ds t_queue contains the nodes of the next level. Once the current level has been processed, the source 
and destination queues swap roles; the process is repeated until the appropriate number of levels have been 
5 traversed. Figure 5 illustrates the state of the queues once the root node and its left child have been 

processed. Prefetch requests have been issued for the right child of the root node and the two children of 
the left child of the root node, which currently occupy the queue. The elements in the queue are the 
candidates for root nodes of the subtrees across which pipelined tree traversals can be performed 

10 Level-order traversal is not generally desirable because of its dynamic storage requirements: the 

queue grows by a factor of A: in a ^-ary tree at eadi level, eventually reaching a size of n/k. Since the 
pipelne depth, represented by Pipe Depth in our example, tends to be small, only a small number of 
nodes need to be enqueued before a sufiTicient number of subtrees have been identified to allow effective 
software pipelining 

15 

Conclusion 

Having described and illustrated the principles of tiie invention in a preferred embodiment thereof 
it shovild be apparent that the inventicn can be modified in arrangement and detail without departing from 
such principles. I claim all modifications and variations coming within the spirit and scope of the 
20 invention. 
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